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AtiTosp^ieric  Coirputations  on  Highly  Parallel  MIMD  Conputers 

A  Fortran  program  representing  a  two-dimensional  vertical  plane 
model  of  the  atmosphere  (see  Appendix)  has  been  modified  to  make  use  of 
the  WT^HCLCTH  (Gottlieb  [80] )  parallel  processor  simulator.  Note  that 
the  original  serial  version  of  this  program,  developed  by  D.  Marchesin 
and  A.  Bayliss,  is  currently  in  use  at  CIMS. 

To  determine  the  suitability  of  this  code  for  execution  on  very 
large  scale  parallel  conputers  of  the  proposed  paracomputer  class  (see 
Schwartz  [80] )  we  analyzed  the  performance  of  the  following  three 
codes:  • - 

Variant  1.  The  original  code  (to  establish  a  control  run). 

Variant  2.  Code  1  as  modified  to  run  under  the  WASHCLOTH  parallel 
processor  simulator,  but  with  one  PE  (processing  element).  In  this 
second  code  version  all  "do  loops"  were  left  intact,  and  no 
synchronization  operations  or  other  overhead  needed  for  parallelization 
were  introduced. 

Variant  3.  Our  final,  fully  parallel  code,  largely  identical  to 
Code  2,  but  with  certain  "do  loops"  replaced  by  code  that  partitioned 
woric  among  multiple  PEs.  This  code  included  all  necessary 
synchronizaticai  statements. 
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A  two-din>ensional  "do  loop"  refexrencing  the  a(k,i)'s  is  then 
rewritten  in  the  following  form,  where  the  function  IRPN  returns  the 
ic3entification  number  of  the  current  PE: 

IP  =  IRPN(X)+1 
DO  1  L=1,NTKI 

J=IIM-(L-1)*N 

IF(J.GT.KD*ID)GO  TO  2 

I^€)(J) 

K=IN(J) 

A{K,I)=  

1  COTTINUE 

2  COOTINUE 

Several  analogous  "c3o  loops"  over  a  one-dimensional  horizontal 
mesh  occur  in  the  atmospheric  model  code.  These  involve  NTI 
(=[(ID-1)/N]+1)  iterations. 

Although,  in  our  experiments,  we  used  the  static  procedure 
described  above  for  assigning  PEs  to  mesh  points,  a  dynamic  procedure 
allowing  flexible  use  of  the  PEs  has  been  developed  {see  Kom  [81]). 
Rather  than  assigning  a  predetermined  set  of  mesh  points  to  each  PE, 
the  dynamic  procedure  determines  the  "next"  mesh  point  whenever  a  PE 
becomes  available.  This  procedure  would  be  the  preferred  approach  when 
seme  mesh  points  require  substantially  more  conputation  than  others. 
However,  in  the  atmospheric  model  we  considered,  the  amount  of  work  at 
each  mesh  point  is  approximately  the  same  and  the  simulated  PEs  all  run 
at  the  same  rate.  Thus  the  sinpler  static  work  allocation  scheme  is 
somewhat  more  efficient. 
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We  note  that  the  additional  oode  required  for  oorrputing  array 
indices  and  for  synchronization  slowed  code  3  down  by  13%  relative  to 
code  2.  The  loss  is  nearly  a  constant  for  the  different  meshes  so,  to 
calculate  the  efficiency  relative  to  a  sequential  machine,  our  values 
should  be  multiplied  by  0.87.  This  slowdown  could  be  made  much  smaller 
by  using  a  compiler  that  generates  in-line  code  for  our  synchronization 
primitives.  It  could  also  be  improved  by  assigning  several  (rather 
than  one)  mesh  points  at  a  time  to  each  PE. 

We  also  note  the  anomalously  low  efficiency  for  N  =  32,  40  and  48 
resulting  frcm  the  fact  that  these  values  of  N  are  not  divisors  of 
^3±her-  the  horizontal  mesh  size  (16)  or  the  two-dimensional  mesh  size 
(80).  This  loss  in  efficiency  is  caused  by  the  fact  that  .some  •  of  the 
PES  remain  idle  during  the  final  iteration  of  certain  of  the  "do  loops" 
described  above. 
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Category  1.  Instructions  executed  by  each  PE  independent  of  the 
number  of  PEs. 

Category  2.  Instructions  divided  airong  the  PEs.  These  are  the 
instructions  in  the  bodies  of  the  loops. 

Category  3.  Instructions  spent  waiting  for  synchronization.  In 
this  category  we  have  two  separate  cases: 

a)  For  one-dimensional  loops  of  size  ID  and  N  PEs,  if  N  >  ID,  then 
N  -  ID  processors  will  be  kept  waiting. 

b)  If  N  is  not  a  divisor  of  the  mesh  size  ID*KD,  then  the 
remainder  are  kept  waiting  during  the  final  iteration  of  each 
two-dimensional  loop. 

Let  the  number  of  instructions  corresponding  to  the  categories 
listed  above  be  denoted  by  CI,  C2,  C3A  and  C3B  respectively.  The 
number  of  instructions  executed  by  each  PE  can  then  be  estimated  by  the 
formula 

C(ID,KD,N)  =  CI  +  C2*ID/N  +  C3A*MAX(0,  (N-ID)/tQ)  +  C3B*M0D(ID*KD,N) 
Several  runs  were  made  in  order  to  determine  the  values  of  the  4 
constants,  giving  the  following  results: 

CI  =  3852,  C2  =  431,000,  C3A  =  1350,  C3B  =  140. 
This  formula  was  used  to  obtain  the  following  chart  of  projected 
efficiency  values.  (Subsequent  runs  with  different  numbers  of  PEs  and 
meshes  of  different  sizes  have  shewn  excellent  agreement  with  these 
predicted  efficiences) .  This  formula  was  used  to  obtain  the  following 
chart  of  projected  efficiency  values. 
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(2)  16  PE  Run. 

Numter  of  instructions  executed  -  8,661. 

Iqmber  of  fetches  of  private  variables  -  1,158;  stores  -  522. 

Number  of  fetches  of  public  variables  -   898;  stores  -  126. 

Number  of  Repadds  -  10. 

Public/private  variable  references  -  .616071 

Public  references/instructions  executed  -  .119501. 

(3)  48  PE  Run. 

Number  of  instructions  executed  -  3,784. 

Number  of  fetches  of  private  variables  -  486;  stores  -  230. 

Number  of  fetches  of  public  variables  -  409;  stores  -  62. 

Number  of  Repadds  -  10. 

Public/private  variable  references  -  .673184. 

Public  references /instruct ions  executed  -  .127378. 

We  note  that  the  number  of  instructions  executed  in  the  case  of  16 
PES  is  nearly  linear  compared  with  the  case  of  8  PEs,  and  sublinear  in 
the  case  of  48  PEs.  The  sublinearity  is  due  to  the  lower  efficiency 
(more  idle  PE  time)  in  the  48  PE  case  as  shown  above  in  the  chart  for 
results  of  the  5  x  16  mesh  run. 

D.  Overall  Conclusion. 

Since  the  serial  execution  of  the  atnosfiieric  model  is  dominated 
by  multi-dimensional  "do  loops"  in  which  the  quantities  conputed  at 
different  mesh  points  at  each  computational  step  are  independent  of 
each  other,  this  code  can  easily  be  adapted  for  parallel  processing. 
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APPENDIX 

The  atiTDspheric  model  program  (developed  by  D.  Marchesin  and  A. 
Bayliss  of  CIMS)  used  for  this  test  solves  the  dry  baroclinic  equations 
in  conservation  form  along  a  fixed  latitude  and  for  up  to  nine  pressure 
levels.  We  used  the  leap  frog  scheme  for  the  time  variable,  4th  order 
central  differences  for  the  horizontal  distance  variable,  and  2nd  order 
central  differences  for  the  vertical  distance  variable.  The  equations 
are  as  follows: 
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v^ere  -n   is  the  surface  pressure,  u  and  v  are  the  wind  components,  a     is 
the  vertical  velocity,  and  <^   is  the  geopotential  height. 


